Journal of Pathology Informatics — Latest Matching Preprints

1

An Agentic, No Code Artificial Intelligence Workflow for Developing and Externally Validating a Thyroid Nodule Ultrasound Malignancy Classifier

Thomas, J.; Pozdeyev, N.

2026-06-26 endocrinology 10.64898/2026.06.23.26356395 medRxiv

Top 0.1%

12.9%

Show abstract

Convolutional neural networks (CNNs) can classify thyroid nodules on ultrasound, yet published models are seldom available for independent testing, require machine learning expertise to develop and deploy, and are validated mostly on papillary thyroid carcinoma. Objective. To test whether an autonomous (agentic), no code artificial intelligence (AI) agent can develop a calibrated thyroid-nodule malignancy classifier, and to validate it internally and on an external cohort spanning multiple cancer histologies. Methods. This is a retrospective, computational diagnostic study with prespecified endpoints. A no code agent (Hugging Face ML Intern) autonomously reviewed data, selected and trained the model and calibrated probabilities, using the open source TN5000 dataset (3500 training, 500 validation, and 1000 test images). The trained ResNet 18 model was externally validated on 232 nodules from the University of Colorado, including follicular, medullary, oncocytic, and follicular variant of papillary carcinomas. Results. On the internal test set, an agentic AI model achieved AUROC 0.94 (95% CI, 0.920 - 0.953), sensitivity 0.90, and specificity 0.80. On external validation, agentic AI model achieved an AUROC of 0.90 (95% CI, 0.850 - 0.936), sensitivity of 0.92, and specificity of 0.68, negative predictive value of 0.96, and positive predictive value of 0.52, exceeding the performance of a previously published classifier on the same cohort (AUROC of 0.83). Conclusions. An agentic, no code AI workflow produced a calibrated, externally validated thyroid nodule classifier, supporting accessible, reproducible, and independently testable medical AI development. Prospective validation and local recalibration are required before clinical use.

2

Can a Tissue-derived Progression Signature Accurately Predict Colorectal Cancer Stage Transitions in Blood?

Sarkar, P.; Sarkar, P.

2026-06-29 bioinformatics 10.64898/2026.06.23.734006 medRxiv

Top 0.1%

2.0%

Show abstract

Colorectal cancer (CRC) is challenging to track because its molecular changes are very complex as the disease progresses, creating significant challenges for robust biomarker discovery. In this study, we developed a machine learning framework by integrating monotonic progression and the StepMiner approach. We conducted external validation to identify reproducible, consistent transcriptomic biomarkers associated with CRC progression. Gene expression datasets were analyzed across four disease states from publicly available GEO: normal colon, adenoma, primary colorectal cancer, and metastasis. First, we identified genes with monotonic expression, then used the StepMiner approach to identify genes that act as switches between stages. A balanced 74-gene signature was used for machine-learning classification with a Random Forest. External validation showed strong performance in tissue-based datasets. However, tissue-derived signatures and plasma and blood-based datasets showed poor performance, highlighting biological differences between transcriptomic profiles. Cross-filtering between tissue-derived genes and blood expression datasets was performed, which resulted in the selection of 62 blood-compatible gene signatures. Leakage-free retraining on GSE164191 achieved a mean AUC of 0.868 with balanced precision. Functional enrichment analysis showed that these genes are highly active in cancer growth. Specifically, genes CBX3, S100A11, PDK4, NCOR1, and SOX4 demonstrated stable and reliable performance across the validation fold. Overall, our study presents a progression-aware transcriptomic framework for CRC biomarker discovery and demonstrates the importance of external validation. Additionally, we evaluate whether tissue-derived signatures can predict blood profiles. This proposed approach may help the future development of tissue-based diagnostics and minimally liquid-biopsy strategies for CRC. To ensure reproducibility, our proposed workflow was automated as a Nextflow pipeline. The tissue-derived model was deployed as an application utilizing Angular, ASP.NET Core, and Plumber (R).

3

Clinical Evaluation of Automated Self-Operated Transvaginal Ultrasound for Ovarian Stimulation Monitoring

Shavit, T.; Bortoletto, P.; Szychter, J.; Mendel, S.; Corcos, Y.; Petrozza, J.; Prisant, N.

2026-06-24 sexual and reproductive health 10.64898/2026.06.21.26356181 medRxiv

Top 0.2%

1.5%

Show abstract

Objective To evaluate the feasibility, safety, patient acceptance, and preliminary clinical relevance of automated self-operated transvaginal ultrasound for ovarian stimulation monitoring. Design Prospective observational pilot study. Subjects Ten women undergoing ovarian stimulation for in vitro fertilization or fertility preservation at a single high-volume private IVF center. Exposure Participants performed investigational self-operated transvaginal ultrasound examinations immediately following standard monitoring visits. Patients inserted and stabilized the ultrasound probe while ovarian and endometrial imaging was acquired through controlled motorized probe rotation without real-time anatomical guidance. Main Outcome Measure(s) The primary outcome was feasibility, defined as the generation of evaluable imaging datasets suitable for ovarian stimulation monitoring. Secondary outcomes included bilateral ovarian visualization, procedural safety, patient-reported outcomes, follicular assessment, and agreement of endometrial thickness measurements with standard transvaginal ultrasound. Result(s) Nineteen investigational scan attempts were performed, yielding 18 evaluable datasets (94.7%). Bilateral ovarian visualization was achieved in 16 of 18 evaluable examinations (88.9%), whereas partial ovarian visualization occurred in 2 examinations (11.1%). No adverse events, adverse device effects, vaginal injury, bleeding, or infection were observed. Patient-reported outcomes demonstrated high procedural acceptability, with all participants expressing willingness to reuse the system. Compared with standard transvaginal ultrasound monitoring, investigational self-operated acquisition significantly improved overall examination experience (Wilcoxon p=0.002). Investigational imaging demonstrated clinically relevant agreement with standard transvaginal ultrasound for follicular categorization and endometrial assessment. Counts of follicles [≥]14 mm correlated strongly with mature oocyte recovery for both investigational and standard ultrasound measurements (Spearman {rho}=0.83 and {rho}=0.80, respectively). Endometrial thickness measurements also demonstrated strong correlation between modalities (Spearman {rho}=0.91). Conclusion(s) This prospective pilot study demonstrates the feasibility of automated self-operated transvaginal ultrasound during ovarian stimulation monitoring. Investigational imaging generated clinically relevant monitoring information without observed safety concerns and was associated with high patient acceptance. These findings support further investigation of patient-operated acquisition strategies and standardized imaging workflows in reproductive medicine.

4

Predicting Chemotherapy Response from Staging Laparoscopy Images

Schnelldorfer, T.; Castro, J.; Goldar-Najafi, A.; Nugent, F. W.; Gaikwad, B.

2026-06-24 oncology 10.64898/2026.06.22.26356226 medRxiv

Top 0.2%

1.1%

Show abstract

Background: For patients with metastatic gastrointestinal cancers, chemotherapy resistance is a common phenomenon that, if known in advance, would allow for individualized treatment decisions. This study aimed to test the feasibility of developing a deep learning computer vision system that uses laparoscopy images depicting peritoneal surface metastases (i.e., capturing the in-vivo optical appearance of metastases as a summary of their molecular makeup) to predict whether a patient is resistant to standard chemotherapy. Methods: The retrospective observational feasibility study included 35 adult patients who underwent staging laparoscopy for non-colon gastrointestinal adenocarcinoma with biopsy-confirmed peritoneal surface metastases and who underwent chemotherapy as their only treatment modality. Chemotherapy resistance was determined based on each patient's observed cancer-specific survival after controlling for confounders. Results: Of 35 patients, 17 were assigned to the chemotherapy sensitive group and 18 to the chemotherapy resistant group. The study cohort provided 1010 laparoscopy image patches of 101 biopsy-confirmed metastases. A densely connected convolutional neural network with cross-validation provided the best results for correctly predicting chemotherapy resistance at the patient level (accuracy 0.80 (95%CI 0.63-0.92), sensitivity 0.72, specificity 0.88, AUC-ROC 0.78). Saliency maps demonstrated the system's trustworthiness. Conclusion: In this study, a prototype surgical computer vision system designed to determine chemotherapy resistance from operative images of peritoneal surface metastases demonstrated its technical feasibility. Further development and validation in a multi-institutional clinical study are pending.

5

Scalable 3D cell-interaction analysis via supercell graphs for prostate cancer risk stratification

Zhao, Y.; Chow, S. S. L.; Yan, R.; Brenes, D.; Serafin, R.; Almagro-Perez, C.; Song, A. H.; Lal, P.; Chan, E.; Downes, M.; Baraznenok, E.; Lopez, J. S.; Madabhush, A.; Mahmood, F.; True, L. D.; Liu, J. T. C.

2026-07-11 pathology 10.64898/2026.07.07.736891 medRxiv

Top 0.2%

1.1%

Show abstract

Cellular interactions underlie fundamental biological processes but are not fully represented in conventional 2D histology images. While 3D pathology allows for more-accurate construction of cell-level graphs, machine-learning models are computationally unwieldy and prone to overfitting, especially when dealing with small cohorts. Here, we introduce SCALE3D, a SuperCell graph Analysis framework for LargE 3D pathology datasets. In SCALE3D, spatially adjacent and morphologically similar cells are grouped into functional "supercells." Supercell subtypes are defined via morphology-based clustering and 3D graphs connecting these supercells are used to model their interactions. Validation was performed with 76 radical prostatectomy specimens from patients with known 5-year biochemical recurrence (BCR) outcomes. SCALE3D-derived features achieve higher performance for BCR prediction than established 3D nuclear and glandular morphological features. Combining these complementary features further improves prediction performance. Compared to individual cell-level 3D graphs, SCALE3D maintains comparable prognostic performance with improved noise tolerance while reducing computational times by up to 1,000-fold.

6

Analytical perturbation reveals hidden instability of biological phenotypes

Piorkowska, N. J.; Ostromecki, A.; Franik, G.; Bizon, A.

2026-07-16 endocrinology 10.64898/2026.07.13.26357916 medRxiv

Top 0.2%

1.1%

Show abstract

Background Unsupervised machine learning has become a cornerstone of computational phenotyping across clinical medicine, genomics, imaging, and multi-omics research. However, phenotype discovery relies on a sequence of analytical decisions - including missing-data handling, preprocessing, dimensionality reduction, clustering methodology, and stochastic initialization - that are rarely evaluated collectively. Although clustering stability has been extensively investigated, the robustness of complete analytical workflows remains largely unexplored. Results We developed an Analytical Perturbation Framework that systematically quantifies the robustness of phenotype discovery by perturbing complete unsupervised learning workflows rather than individual clustering algorithms. Using a real-world cohort of 1,286 women with polycystic ovary syndrome (PCOS), we generated 116 valid analytical pipelines comprising alternative preprocessing strategies, missing-data handling methods, dimensionality reduction approaches, clustering algorithms, and random initializations. Agreement between independently generated phenotype solutions was consistently low (median Adjusted Rand Index = 0.079), indicating substantial sensitivity of phenotype discovery to routine analytical decisions. Variance decomposition identified preprocessing as the largest contributor to phenotype instability (22.8%), followed by clustering methodology (14.6%), whereas stochastic initialization explained only 3.1% of the observed variability. At the patient level, most individuals exhibited reproducible phenotype assignments (median Patient Robustness Score = 0.719), although a substantial subgroup showed markedly lower assignment stability. Feature perturbation analyses identified follicle-stimulating hormone, anti-thyroglobulin antibodies, anti-thyroid peroxidase antibodies, total testosterone, luteinizing hormone, and androstenedione as the strongest contributors to computational robustness, rather than biological importance. Finally, phenotype solutions demonstrating greater computational robustness also exhibited greater biological coherence during independent validation.

7

Homologous recombination deficiency prediction from whole slide images using label refinement and foundation-model benchmarking in ovarian cancer

Shah, N. A.; Sarwar, M.; Ullah, E.

2026-06-30 pathology 10.64898/2026.06.25.734452 medRxiv

Top 0.2%

1.1%

Show abstract

Background: Homologous recombination deficiency (HRD) is clinically imperative in high-grade serous ovarian carcinoma (HGSOC), particularly because of its association with platinum sensitivity and benefit from poly(ADP-ribose) polymerase inhibitor (PARPi) therapy. However, public datasets rarely contain a complete combination of diagnostic haematoxylin and eosin (H&E) whole-slide images (WSIs), validated clinical HRD assay results, genomic scar scores, BRCA1 promoter methylation data, and treatment-response outcomes. This creates a major barrier for computational pathology studies seeking to develop clinically interpretable models of HRD or PARPi response from routine histology. Objective: We performed an exploratory, leakage-controlled computational pathology benchmarking study to evaluate whether H&E WSIs from TCGA-OV contain a measurable morphology-linked signal associated with research-grade molecular HRD labels, and whether label refinement and pathology foundation-model embeddings alter predictive performance. Methods: We assembled a frozen-primary TCGA-OV WSI cohort comprising 717 tissue-section/biospecimen slides from 316 patients. Diagnostic FFPE DX slides were excluded from model selection because of complete patient overlap with the frozen-primary cohort. Two HRD labels were evaluated: an initial mutation-only molecular label based on BRCA/HR-gene mutation evidence, and a refined methylation-enhanced molecular label that additionally incorporated BRCA1 promoter methylation. Feature extraction was performed using ResNet50, UNI, CONCH, Virchow2, Phikon-v2, and UNI2-h encoders. Patient-level attention-based multiple instance learning (ABMIL) was used with patient-as-bag modelling. Evaluation used patient-level grouped 5-fold x 5-repeat stratified cross-validation, with 25 folds total, bootstrap confidence intervals, and patient-level leakage control. Results: The initial mutation-only label classified 78 patients as positive and 238 as negative. The refined methylation-enhanced label recovered 33 additional positives, resulting in 111 positive and 205 negative patients. Patient-level ABMIL using UNI2-h features achieved the strongest performance for the refined label, with AUROC 0.634 (95% CI 0.571-0.698), AUPRC 0.468 (95% CI 0.390-0.562), balanced accuracy 0.597, sensitivity 0.532, specificity 0.663, F1 score 0.494, and Brier score 0.233. The calibrated threshold was 0.512, yielding TN=136, FP=69, FN=52, and TP=59. Comparative models showed lower discrimination, including UNI2-h with the initial label (AUROC 0.628), Phikon-v2 refined (0.582), Virchow2 refined (0.582), CONCH initial (0.587), ResNet50 refined (0.570), and clinical baselines (AUROC 0.54-0.57). Conclusions: TCGA-OV H&E WSIs contain a modest but reproducible morphology-linked signal associated with research-grade molecular HRD status. However, the AUROC around 0.63, absence of clinical HRD assay labels, lack of genomic scar endpoints in the implemented workflow, and absence of PARPi/platinum response targets prevent clinical interpretation. This study should be interpreted as a proof-of-concept benchmarking framework and methodological foundation for future H&E-based predictive modelling in clinically curated PARPi response cohorts.

8

Bone Marrow Mesenchymal Stem Cells Therapy for Premature Ovarian Insufficiency: A Systematic Review and Meta-analysis of Preclinical Studies

Plane, J.; Torres, F.; Vera, P.; Vantman, D.; Andrews, B. A.; Asenjo, J. A.; Caviedes, P.; Daza, A.

2026-07-09 cell biology 10.64898/2026.07.02.736116 medRxiv

Top 0.2%

0.9%

Show abstract

BackgroundPremature ovarian insufficiency (POI) affects approximately 1% of women under 40 and is characterized by elevated levels of gonadotropins, reduced estradiol, impaired folliculogenesis, and infertility. Bone marrow-derived mesenchymal stem cell (BM-MSC)-based therapy has emerged as a promising regenerative strategy in preclinical POI models. This systematic review and meta-analysis evaluated BM-MSC-based interventions, including cell transplantation and secretome/extracellular vesicle administration, in animal models of POI. MethodsA systematic review and meta-analysis was conducted following PRISMA guidelines. PubMed, Web of Science, Scopus, ScienceDirect, and the Cochrane Library were searched from inception to February 19, 2025. Preclinical studies assessing BM-MSC-based interventions in animal models of POI were included. ResultsThirty-four studies comprising 1,357 animals were included. Compared with controls, BM-MSC-based therapy increased serum estradiol (standardized mean difference [SMD] 3.11; 95% confidence interval [CI] 2.38-3.84) and anti-Mullerian hormone (SMD 1.86; 95% CI 1.03-2.69), while reducing follicle-stimulating hormone (SMD -3.54; 95% CI -4.37 to -2.71) and luteinizing hormone (SMD -3.44; 95% CI -5.17 to -1.70). Follicular counts increased across developmental stages, with fewer atretic follicles. Reproductive outcomes improved, including normal estrous cycles (risk ratio [RR] 7.80; 95% CI 3.15-19.34), pregnancy occurrence (RR 3.72; 95% CI 2.14-6.44), and offspring number (SMD 1.57; 95% CI 1.04-2.09). ConclusionBM-MSC-based therapy consistently improved hormonal, follicular, and reproductive outcomes in preclinical POI models. More well-designed, standardized, and adequately controlled studies to confirm these findings are warranted. Systematic review registration: CRD42023449053

9

A modular generalist-specialist AI framework for ROI selection across spatial profiling workflow

Castillo, S. P.; Gautam, T.; Pinao Gonzales, K. B.; Salvatierra, M. E.; Serrano, A.; Ercan, C.; Rodriguez, B. L.; Acosta, P.; Chen, P.; Shokrollahi, Y.; Lau, A.; Kwong, L. N.; Huse, J. T.; Pan, X.; Patient Mosaic Team, ; Solis Soto, L. M.; Yuan, Y.

2026-07-01 pathology 10.64898/2026.06.26.734862 medRxiv

Top 0.3%

0.9%

Show abstract

Selection of regions of interest (ROIs) is often a crucial step in spatial molecular profiling and many pathology tasks, with substantial implications for research reproducibility and biological interpretability. To provide a reproducible and adaptive framework for AI-guided ROI selection, we developed a modular generalist-specialist solution across spatial profiling platforms. In a cohort comprising 55 tumor types from 160 tissue donors profiled using NanoString Digital Spatial Profiling and multiplex immunofluorescence, we first established a protein-profiling reference atlas capturing compartment-specific immune, checkpoint, stromal, and proliferation patterns. We then developed an AI Specialist Task-Oriented Model for ROI Selection (ASTROS) and tested comprehensive benchmarks considering specialist-only (ASTROS), generalist-only (PLIP/GFM), and hybrid generalist-specialist strategies, showing that the latter provides a balanced tradeoff across slide-level signal preservation, pathologist-reference concordance, within-slide placement consistency, and large-slide computational efficiency. We further demonstrated the feasibility of virtual staining for ROI preview and modular ROI placement for other spatial omics technologies, Visium and Visium HD workflows. Together, these results support our proposed framework to enable ROI selection responding to unmet needs for reducing inter-rater variability, reproducibility, and versatility in spatial profiling experiments.

10

Drivers of Diagnostic Variation in a Digital Global Kidney Transplant Reader Study

Hofstraat-Boersma, R.; du Long, R.; Buzzanca, G.; Abiola, A. A.; Albadri, S.; Ali, Z.; Altaleb, A.; Angioi, A.; Banu, S. G.; Barry, M.; Bhalodia, A. R.; Bianco, P.; Broecker, V.; Buelow, R.; Chauveau, B.; Chen, G.; Cheunsuchon, B.; Crisi, G. M.; Daneshvar, S.; Dendooven, A.; Dokouhaki, P.; Drachenberg, C. B.; Farris, A. B.; Ferlicot, S.; Florquin, S.; Fontana, F.; Gibier, J.-B.; Gibson, I. W.; Gujarathi, S.; Hendricks, A. R.; Husain, S.; Islam, J.; Ismail, W.; Jagannathan, G.; Klager, J.; Kozakowski, N.; Krizova, A.; Kurien, A. A.; Kwon, B.; L'Imperio, V.; Ledesma, F. L.; Low, J. P.; Martin, J

2026-07-13 pathology 10.64898/2026.07.09.26357318 medRxiv

Top 0.3%

0.9%

Show abstract

Background Diagnostic interpretation of kidney allograft biopsies using the Banff classification remains variable, but the determinants of this variability are not fully defined. We performed a global, fully digital multi-reader study to identify the principal drivers of disagreement in Banff-based assessment. Methods Thirty six kidney transplant biopsies were independently scored by 67 renal pathologists on a standardized digital platform. Readers assessed Banff lesions on hematoxylin and eosin, periodic acid Schiff, and Jones' silver stains; final diagnostic categories were assigned using prespecified Banff-based decision rules. Interobserver agreement was quantified with Gwet's agreement coefficient (AC) statistics. Determinants of diagnostic agreement were evaluated) using pairwise mixed-effects logistic regression, and reader similarity was examined by principal component analysis (PCA) with post hoc molecular annotation. Results Agreement for final diagnostic categories was moderate (Gwet's AC1, 0.55; 95% CI, 0.47 - 0.63). Lesion-level agreement varied substantially, with lowest agreement for selected threshold-dependent inflammatory or semi-quantitative lesions, including interstitial inflammation in areas of IFTA, peritubular capillaritis and arteriolar hyalinosis. Diagnostic concordance differed markedly across biopsies, indicating strong case-level heterogeneity. In pairwise models, differences in active inflammatory and vascular lesion scoring were the strongest correlates of diagnostic disagreement; reader experience and geography contributed minimally. Principal component analysis showed reader variation was organized along two dominant axes: a rejection-calling threshold axis linked mainly to tubulointerstitial inflammatory injury, and a T cell-mediated (TCMR/TI) and antibody-mediated/microvascular (AMR/MVI) inflammation-oriented phenotypic classification axis. Conclusion Interobserver variation in Banff-based kidney transplant biopsy assessment is structured rather than random and driven mainly by how readers threshold and integrate key inflammatory lesion compartments rather than experience or geographic location.

11

Automated Phenotypic Characterization in Rare Hematologic Malignancies Using a Large Language Model-Based Framework

Khan, M. A.; Ayub, U.; Jajja, S. A.; Anjum, M. U.; Warraich, K.; Jain, P.; Oberoi, J. K.; Al Abbas, M.; Sadiq, M. H.; Sarfraz, M. U.; Huang, Z.; Riaz, I. B.; Palmer, J. M.

2026-07-09 health informatics 10.64898/2026.06.26.26356633 medRxiv

Top 0.3%

0.9%

Show abstract

Background. Diagnosis and risk stratification in rare hematologic malignancies such as myeloproliferative neoplasms (MPNs) - polycythemia vera (PV), essential thrombocythemia (ET), and myelofibrosis (MF) - require expert review of longitudinal, heterogeneous clinical records. This process is cognitively demanding, inconsistently applied, and difficult to scale beyond tertiary centers. No automated phenotyping workflow currently exists for hematologic malignancies. Methods. A HIPAA-compliant large language model (LLM) framework for phenotyping MPN was developed to integrate (i) rule-based retrieval of bone marrow biopsy reports, clinical notes, and structured laboratory results from the electronic health record (EHR); (ii) zero-shot extraction of diagnostic and prognostic variables from unstructured text using GPT-4 Turbo; (iii) a clinician-informed source-prioritization algorithm to reconcile conflicting multi-source data; (iv) WHO/ICC-criteria-based diagnostic classification; and (v) NCCN-based risk stratification using the conventional risk model for PV, IPSET-thrombosis for ET, and DIPSS, DIPSS-plus, and MIPSS70/MIPSS70+ v2 for MF. Patients were identified via MPN-related ICD-9/10 codes; cases met 2017 WHO criteria or had a hematologist-documented diagnosis, and controls did not. The cohort was split into a prompt-development set (n = 60) and a held-out test set (n = 450; 75 cases and 75 controls per disease). Ground truth was established by independent dual-clinician chart review with consensus adjudication. LLM performance was evaluated against the ground truth: variable-level extraction using accuracy, F1 score, and Cohen's kappa; patient-level diagnostic classification using sensitivity, specificity, and Cohen's kappa; and prognostic risk stratification (among confirmed cases) using accuracy, weighted F1 score, and quadratic-weighted Cohen's kappa. Wilson 95% confidence intervals (CIs) were used for proportions and bootstrap 95% CIs with 500 resamples for F1 scores. Results. The held-out test set included 450 patients (PV: 150; ET: 150; MF: 150) with pathology reports and structured laboratory results, and 172 patients (PV: 52; ET: 55; MF: 65) with clinical notes. From pathology reports, overall variable extraction accuracy and F1 score were 99% (95% CI, 98-100) and 1.00 (0.99-1.00) for PV, 100% (99-100) and 0.99 (0.96-1.00) for ET, and 100% (99-100) and 0.99 (0.97-1.00) for MF. From clinical notes, overall accuracy and F1 score were 96% (91-100) and 0.94 (0.85-1.00) for PV, 100% (100-100) and 1.00 (1.00-1.00) for ET, and 100% (99-100) and 0.98 (0.95-1.00) for MF. Diagnostic sensitivity was 100% (95% CI, 95.1-100.0) for PV, ET, and MF; specificity was 98.7% (92.8-99.8) for PV and 100% (95.1-100.0) for both ET and MF, with Cohen's kappa of 0.99 for PV and 1.00 for ET and MF. Risk stratification accuracy was 100% with weighted F1 score of 1.00 and quadratic-weighted Cohen's kappa of 1.00 across all three diseases. A pre-specified source-ablation analysis showed that pathology reports alone were sufficient for diagnosis (sensitivity 98.7% for PV, 100% for ET, 96.0% for MF; specificity 100% across all three subtypes) but inadequate for prognostication (accuracy 69.3% for PV, 93.3% for ET, 77.3% for MF). Adding clinical notes to pathology reports recovered full prognostic accuracy of 100% across all three diseases. Conclusions. This first-in-class automated framework achieved expert-level performance for MPN diagnosis and risk stratification from real-world EHR data, establishing a foundation for scalable, standardized phenotyping in rare hematologic malignancies. Prospective, multi-site validation is warranted before clinical deployment.

12

Metabolomic signatures support the diagnostics of peritoneal endometriosis using generalised linear models.

Cecil, A.; Vouk, K.; Novak Pusic, M.; Vogler, A.; Wenzl, R.; Prehn, C.; Adamski, J.; Lanisnik Rizner, T.

2026-07-07 systems biology 10.64898/2026.07.05.736551 medRxiv

Top 0.3%

0.6%

Show abstract

Endometriosis, a common inflammatory gynecological disorder affecting up to 10% of women worldwide, is characterized by the presence of endometrium-like tissue outside the uterus. Current diagnostic methods, such as ultrasound and MRI, effectively detect ovarian and deep endometriosis but fail to detect more common peritoneal type. Diagnosing peritoneal endometriosis currently necessitates invasive laparoscopy and histological confirmation. Despite numerous efforts, no new reliable biomarkers have successfully transitioned into routine clinical use. This study aimed to investigate the use of targeted metabolomics to discover metabolite ratios capable of identifying endometriosis in plasma samples. We analyzed a discovery population of 235 patients and a validation population of 278 patients. All cases and controls in both populations were diagnosed by laparoscopy. Control subjects included individuals presenting with symptoms such as pain, dysmenorrhea, infertility, or other benign conditions, but who had no laparoscopic evidence of endometriosis. Using generalized linear models (GLMs) and machine learning, the study identified specific metabolite ratios as potential biomarkers that can distinguish different types of endometriosis and enable mass spectrometry-based diagnostics for peritoneal endometriosis. The best-validated GLM, derived from the concentration ratios of amino acids, acylcarnitines, sphingomyelins, and phosphatidylcholines, consisted of Thr/SM(OH) C22:2 + PC aa C40:5/SFA_PC + lysoPC a C16:0/SM(OH) C16:1. This model yielded an AUC of 0.82 (95% CI 0.619-0.891, with 76% sensitivity and 81% specificity) for peritoneal endometriosis. This innovative approach offers a robust diagnostic model, addressing an unmet medical need by facilitating earlier detection of peritoneal endometriosis and improving overall clinical management.

13

No single biological phenotype exists in polycystic ovary syndrome: evidence from cross-space phenotyping

Piorkowska, N. J.; Ostromecki, A.; Franik, G.; Bizon, A.

2026-07-10 endocrinology 10.64898/2026.07.09.26357636 medRxiv

Top 0.3%

0.6%

Show abstract

Context Polyendocrine metabolic ovarian syndrome (PMOS), formerly known as polycystic ovary syndrome (PCOS), is a biologically heterogeneous disorder, yet previous clustering studies have reported inconsistent phenotype structures. Whether these discrepancies reflect methodological variability or genuine multidimensional disease biology remains unknown. Objective To determine whether independently derived endocrine, metabolic, inflammatory, and thyroid phenotypes represent the same underlying biological structure or capture distinct dimensions of PMOS heterogeneity. Design Cross-sectional observational study using a cross-space phenotyping framework. Setting Tertiary referral outpatient endocrinology and gynecology clinic. Participants A total of 1,286 women were diagnosed with PCOS according to the Rotterdam criteria. Methods Four predefined biological spaces (endocrine, metabolic, inflammatory, and thyroid) were analyzed independently. Within each space, standardized preprocessing, dimensionality reduction, and unsupervised clustering were performed. Cluster robustness was evaluated using bootstrap resampling, while agreement between independently derived phenotypes was quantified using the adjusted Rand index (ARI). Biological relevance was assessed using independent non-circular validation with variables excluded from phenotype derivation. Sensitivity analyses compared complete-case and imputed datasets. Results All four biological spaces produced highly stable clustering solutions (bootstrap ARI: endocrine 0.915, metabolic 0.964, inflammatory 0.930, thyroid 0.990). Despite this robustness, agreement between independently derived phenotypes remained consistently low. The highest concordance was observed between metabolic and inflammatory phenotypes (ARI = 0.208), followed by endocrine and metabolic phenotypes (ARI = 0.159), whereas agreement involving thyroid phenotypes was close to zero. Independent non-circular validation confirmed that all identified phenotypes represented biologically coherent patient subgroups beyond the variables used for clustering. Sensitivity analyses demonstrated high agreement between complete-case and imputed solutions, supporting the robustness of the findings. Conclusions Stable biological phenotypes exist within individual physiological domains of PMOS but do not converge into a single overarching biological phenotype. These findings support a multidimensional model of PMOS heterogeneity in which endocrine, metabolic, inflammatory, and thyroid systems describe complementary rather than interchangeable aspects of disease biology. Cross-space phenotyping provides a general framework for investigating biological heterogeneity in complex disorders and may facilitate future precision medicine approaches.

14

Seeing Nothing, Saying Something: The Lack of Visual Grounding and Confabulation in Gemini Models for Histopathology

Hasan, M. M.; Tozal, M. E.; Ayhan, M. S.

2026-07-07 health informatics 10.64898/2026.07.04.26357257 medRxiv

Top 0.4%

0.6%

Show abstract

Large vision-language models (VLMs) have demonstrated remarkable perfor- mance on computational pathology benchmarks, yet their reliability under adversarial or vacuous inputs remains poorly understood. This paper examines the visual grounding behaviour of two Gemini models Gemini 3.0 Flash Pre- view (gemini-flash) and Gemini 3.1 Pro Preview (gemini-pro) on a well known histopathology classification task, and probes for confabulation using a adver- sarial blank-image set. On the real histopathology dataset both models achieve near-perfect accuracy (98.75% - 100%) across three temperatures (0.0, 0.5, 1.0) and three independent runs. On a controlled adversarial set of blank white images labelled as either benign or malignant, however, a stark divergence emerges. Gemini-flash consistently acknowledges the absence of visual content and assigns zero confidence, while Gemini-pro fabricates detailed, clinically plausible histo- logical descriptions and reports high confidence (mean {approx} 0.95) across the same blank inputs, a behaviour we term confident confabulation. The confabulation rate of gemini-pro reaches 77.8% image-responses at temperature 0.0, dropping to 44.4% at temperature 0.5 and rising to 66.7% at temperature 1.0, while gemini- flash records 0% at all temperatures. These findings raise important questions about the safety and trustworthiness of VLMs in clinical decision-support con- texts, and underscore the need for comprehensive evaluation beyond standard accuracy metrics.

15

Deep Transfer Learning for Dormancy and Outbreaking State Classification in Metastatic Breast Tumor Cells: A Benchmark of Modern Deep Learning Models

Sharma, O.;Weidenfeld, K.;Barkan, D.;Gal, O.

2026-06-23 Cancer Biology 10.64898/2026.06.22.733720 medRxiv

Top 0.4%

0.5%

Show abstract

Breast cancer cells that disseminate to distant organs can remain dormant (non-proliferative) for years before reactivating and progressing into lethal metastatic disease. Understanding the transition between dormancy and reactivation is therefore critical for early intervention and treatment. In this study, we investigate a comprehensive range of deep learning (DL) architectures to classify dormant versus proliferative breast tumor cells within a 3-dimensional growth factor reduced basement membrane extract (3D BME) system that models tumor dormancy and outgrowth. To capture the underlying spatiotemporal dynamics, we evaluate both spatial and sequence-based learning approaches. We consider convolutional neural networks (EfficientNet, ResNet, DenseNet, MobileNet, VGG, AlexNet), segmentation-based models (U-Net, U-Net++, Attention U-Net, DeepLabV3, HRNet) and transformer-based architectures (Vision Transformer, Swin Transformer, SegFormer). We investigate transfer learning using both fixed and fine-tuned strategies. Experimental results show that classification performance is greatly enhanced through the integration of temporal information. EfficientNet-B7, EfficientNet-B6, DenseNet-169, and DenseNet201 are consistently better than competing architectures for all tested models. EfficientNet-B7 with the use of temporal sequences input reaches an accuracy of 98.86% with a ROC-AUC of 0.998. The results highlight the significance of spatio-temporal feature learning and the value of DL frameworks in automated classification of dormant versus proliferative breast cancer cells in physiologically relevant microenvironments.

16

Automated Segmentation of Prostatic Gold Fiducial Markers for MR-Only Radiotherapy Planning Using Multi-Modal Consensus Deep Learning

Stewart, A. W.; Goodwin, J.; Richardson, M.; Robinson, S. D.; O'Brien, K.; Jin, J.; Barth, M.

2026-06-23 bioinformatics 10.64898/2026.06.18.733061 medRxiv

Top 0.4%

0.3%

Show abstract

PurposeTo develop and evaluate a multi-model consensus deep learning approach for automated gold fiducial marker (FM) segmentation in T1-weighted prostate MRI. Materials and MethodsIn this retrospective study, T1-weighted MRI and CT-derived reference standard segmentations were collected from 127 prostate cancer patients (all male; mean age, 70 years {+/-} 7 [standard deviation]; age range, 50-88 years; collected between October 2020 and January 2026) who each had three implanted gold FMs. A 3D U-Net was trained on 93 subjects using four random seeds to produce an ensemble. At inference, marker-class probability maps were averaged across models and the top three connected components selected. Performance was evaluated on 34 temporally held-out subjects (9 tuning, 25 test) using marker-level sensitivity and precision with exact (Clopper-Pearson) 95% confidence intervals (CIs). A model count ablation study was performed. The pipeline was deployed for on-scanner processing on Siemens MRI systems via the OpenRecon framework and as a browser-based application using WebAssembly, executing entirely client-side. ResultsThe four-model consensus achieved 96% (70 of 73) sensitivity and 95% (70 of 74) precision on 25 test subjects, with 29 of 34 (85%) subjects achieving perfect marker detection. Single models had a mean sensitivity of 84% (SD, 9%), improving to 96% with four-model consensus (SD, <1%). ConclusionMulti-model consensus deep learning substantially improved FM segmentation reliability over individual models, achieving high sensitivity and precision using only routinely acquired T1-weighted MRI.

17

Endocrine - metabolic network architecture reveals key bridge biomarkers in polycystic ovary syndrome

Piorkowska, N. J.; Franik, G.; Bizon, A.

2026-07-14 endocrinology 10.64898/2026.07.10.26357756 medRxiv

Top 0.5%

0.3%

Show abstract

Context: Polycystic ovary syndrome (PCOS) is a heterogeneous endocrine disorder involving complex interactions among endocrine, metabolic, inflammatory, and thyroid pathways. However, the systems-level organization of these interactions remains poorly understood. Objective: To reconstruct the endocrine-metabolic biomarker network in women with PCOS and identify bridge biomarkers integrating distinct physiological domains. Design: Retrospective cross-sectional study. Setting: Single tertiary referral center. Participants: A total of 1,286 women diagnosed with PCOS according to the revised Rotterdam criteria. Methods: Twenty-nine routinely measured laboratory biomarkers representing endocrine, metabolic, hematological/inflammatory, and thyroid domains were analyzed. Sparse Gaussian graphical models were estimated using Graphical LASSO with Extended Bayesian Information Criterion model selection. Network topology, node centrality, bridge centrality, bootstrap resampling, and predefined sensitivity analyses were performed. Results: The reconstructed network comprised 29 biomarkers connected by 73 conditional dependency edges (network density, 0.18), demonstrating a modular but highly integrated endocrine-metabolic architecture. Conventional centrality analysis primarily identified biomarkers organizing local physiological modules, whereas bridge-centrality analysis revealed biomarkers coordinating communication between biological domains. Sex hormone-binding globulin exhibited the highest bridge strength, followed by fasting insulin, triglycerides, and high-density lipoprotein cholesterol. Additional reproducible bridge biomarkers included free thyroxine, white blood cell count, 2-hour plasma glucose, absolute neutrophil count, androstenedione, and anti-thyroglobulin antibodies. The leading bridge biomarkers remained stable across bootstrap resampling, complete-case reconstruction, and alternative network specifications. Conclusions: PCOS is characterized by an integrated endocrine-metabolic network organized around a limited number of reproducible bridge biomarkers linking multiple physiological systems. Network analysis provides complementary systems-level information beyond conventional biomarker evaluation and may facilitate future biological phenotyping and precision medicine approaches in PCOS.

18

Co-creating the Butterfly Multimedia Patient Education Platform for Thyroid Surgery Along the Patient Pathway: Lessons for Participatory Digital Patient Education

Moschofidou, M.; Sykiotis, G. P.

2026-06-29 endocrinology 10.64898/2026.06.25.26356545 medRxiv

Top 0.5%

0.3%

Show abstract

Objective: To describe and critically appraise the participatory design process underpinning a multimedia patient education platform for thyroid surgery in light of subsequent evaluation findings, in order to derive human?factors and methodological lessons for future co?creation. Methods: Using a multi-phase participatory design approach anchored in the Technology Acceptance Model and Mayer's Cognitive Theory of Multimedia Learning, including dual?channel processing and cognitive load principles, a multidisciplinary team including a patient representative co?created Medtronic's Butterfly platform for preoperative thyroid patients. Two Nominal Group Technique-based workshops mapped a six-stage surgical pathway, identified stage-specific information needs, and specified formats and sequencing for educational content. An external communication agency developed storyboards as precursors to webpages, animated videos, and leaflets, which underwent iterative advisory board review and content validity indexing before external mixed-methods evaluation in a cohort of thyroidectomy candidates. Results: The co-creation process yielded a modular platform comprising five informational webpages, five short animated videos, and three downloadable leaflets, each mapped to specific pathway stages, learning objectives, and information needs. The platform achieved high scores on validated measures of information quality, credibility, understandability, and actionability, and patients reported enhanced preparedness and decision-making support. However, formal readability assessments of webpages and video scripts showed that most materials exceeded recommended grade levels, usability issues in digital user experience (e.g., mobile responsiveness, navigation, subtitles) emerged only after implementation, and group-level stress, anxiety, and depression scores did not substantially improve despite perceived informational and emotional support. Conclusions: Participatory design facilitated the creation of a clinically credible, pathway-aligned educational platform that addresses key informational gaps in thyroid surgery care but also exposed boundaries of information-focused co-creation. Future projects should treat readability as a non-negotiable design constraint, extend co-creation to systematic usability testing of digital interfaces, and explicitly distinguish educational from psychosocial aims while co-designing complementary distress screening and referral pathways. The Butterfly platform offers a transferable model for co-creating high-quality digital patient education while highlighting the need to more fully center equity, accessibility, and broader patient support in participatory design.

19

CellDF: Quality-controlled cell matching for whole-slide HE-IHC label transfer

Jang, E.; Huh, Y.-M.

2026-06-24 pathology 10.64898/2026.06.18.733058 medRxiv

Top 0.5%

0.2%

Show abstract

Serial-section immunohistochemistry (IHC) is the largest available source of paired hematoxylin and eosin (HE) and IHC whole slide images, yet it remains underexploited for cell-level supervision: adjacent sections sample non-identical cells, and residual registration error prevents direct assignment of IHC labels to individual HE cells. We present CellDF (Cell Displacement Field), which turns registered serial-section data into pairs of HE cells and their IHC labels by solving cell matching at whole-slide scale and assessing its reliability without ground-truth correspondences. CellDF estimates a locally adaptive residual displacement field through iterated kernel regression over each HE cells K nearest IHC candidates; a sparse-kernel variant keeps it tractable at the cell counts of a whole slide, where pairwise matchers are not. The within-tile distribution of the estimated displacements yields two ground-truth-free statistics, the directional scatter{sigma}{theta} and the between-tile angular deviation |{Delta}{theta}|, that localize matching quality more finely than landmark-based target registration error and drive a two-stage outlier filter that withholds labels where matching is unreliable. On 54 same-section HyReCo pairs,{sigma}{theta} correlates only moderately with landmark error and flags localized restaining damage that global error misses; on 30 four-marker Acrobat serial-section cases, the same statistic flags which IHC marker, if any, lies physically close enough to HE to support cell-level transfer. As a proof of concept, IHC labels transferred through CellDF trained a cell classifier on HE embeddings that generalized to held-out cells within the sample (F1 0.85, AUROC 0.88), establishing serial-section IHC as a usable cell-level labeling resource. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=78 SRC="FIGDIR/small/733058v1_ufig1.gif" ALT="Figure 1"> View larger version (42K): org.highwire.dtl.DTLVardef@a9b3dcorg.highwire.dtl.DTLVardef@15f652corg.highwire.dtl.DTLVardef@1eb3396org.highwire.dtl.DTLVardef@87dda2_HPS_FORMAT_FIGEXP M_FIG C_FIG

20

Protocol for standardized minimally invasive mouse models of bisphosphonate-related and radiation-induced jaw osteonecrosis

Ding, Z.; Zhang, J.; Liu, H.; Chandra, A.; Risbud, M. V.; Kusumbe, A. P.; Chen, J.

2026-07-03 pathology 10.64898/2026.06.28.735116 medRxiv

Top 0.5%

0.2%

Show abstract

This protocol describes a standardized and reproducible minimally invasive approach for establishing mouse models of bisphosphonate-related osteonecrosis of the jaw (BRONJ) and osteoradionecrosis of the jaw (ORNJ). The method combines a unified low-trauma oral surgical procedure with disease-specific injury induction strategies to generate robust and clinically relevant models of jaw osteonecrosis. For BRONJ, systemic zoledronic acid administration is coupled with mandibular first molar extraction using tape-assisted mouth opening and customized bent micro-forceps, minimizing soft tissue damage and reducing procedural variability. For ORNJ, a customized lead-shielding platform enables precise, noninvasive mandible-targeted irradiation, producing reproducible bone injury while limiting off-target radiation exposure. Together, these complementary models provide a consistent and minimally invasive framework for investigating jaw osteonecrosis arising from distinct etiologies. The protocol supports comprehensive downstream analyses, including micro-computed tomography, histology, and immunofluorescence, and facilitates mechanistic studies of disease pathogenesis, bone regeneration, and therapeutic intervention.